Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize repeated pointer accesses in op/base reduction functions #9717

Closed
wants to merge 1 commit into from
Closed

Conversation

gkatev
Copy link
Contributor

@gkatev gkatev commented Dec 1, 2021

Hi, this PR changes some repeated pointer accesses in the reduction function range checks, so that they are only performed (the pointer accesses) once. (Hope I didn't miss any, are there any reduction functions in another file?)

I wouldn't expect this change to make a difference, and I probably would expect the compiler to optimize the original code on its own, but I have experimentally observed increased performance with this change. Might well not be the case for a different system or compiler, but the changes shouldn't really have any downside.

Allreduce performance (latency) before patch
10 runs: Average 6374.9 us, Stddev 30.2834 us, Min 6327.88 us, Max 6420.74 us

Allreduce performance (latency) after patch
10 runs: Average 5776.3 us, Stddev 41.5984 us, Min 5705.17 us, Max 5828.30 us

Setup

osu_allreduce_cc @ 1M (data-altering osu_allreduce variant)
coll: libnbc,basic
pml: ob1
btl: vader,self (xpmem)
bind-to: core
map-to: slot

CentOS 8, GCC 8.5.0, aarch64 (single node)
--enable-debug not specified during config

Disclaimer: Tested on v5.0.0rc2

Signed-off-by: George Katevenis gkatev@ics.forth.gr

Signed-off-by: George Katevenis <gkatev@ics.forth.gr>
@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@jsquyres
Copy link
Member

jsquyres commented Dec 1, 2021

ok to test

@jsquyres
Copy link
Member

jsquyres commented Dec 1, 2021

I'm a little surprised that we left that pointer-dereference assignment in all those loops, but I guess it just got missed.

I double checked MPI-4.0 to make sure that the len argument isn't an INOUT, and it doesn't appear to be. MPI-4.0 p235:13-16 says:

The len argument ... is passed in as a reference for reasons of compatibility with Fortran.

@bosilca Can you verify?

@ggouaillardet
Copy link
Contributor

@jsquyres the Git commit checker did not run (even if you issue the bot command`) because this is a new contributor (!), so I had to manually hit a 'Approve and Run' button (iirc) in the web interface.

@ggouaillardet
Copy link
Contributor

@gkatev thanks for the patch!

Out of curiosity, did you make any performance measurements (with Open MPI compiled for production - e.g. -O2 or more aggressive)?
Though the patch looks correct at first glance, I would naively suspect the compiler optimize these pointer dereferences, ending up generating the same code with or without this patch.

@devreal
Copy link
Contributor

devreal commented Dec 1, 2021

This might be an aliasing issue: the compiler cannot assume that count isn't aliased by another pointer, unless it's marked as restrict. Hence the repeated load from memory.

@gkatev
Copy link
Contributor Author

gkatev commented Dec 1, 2021

@ggouaillardet I think I am not seeing a difference with O2/O3, but am I doing it correctly like this?
CFLAGS=-O2 make -C ompi
Or is there a config-time or otherwise way to tune the optimization level?

@awlauria
Copy link
Contributor

awlauria commented Dec 1, 2021

@gkatev I believe you'll need to pass in CFLAGS at configure time for it to have any effect.

you can check with make -V2 to see it got passed in.

@awlauria
Copy link
Contributor

awlauria commented Dec 1, 2021

FWIW in past projects I have seen these little optimizations making a difference.

Compilers should be smart enough, but can't always rely on that.

@bosilca
Copy link
Member

bosilca commented Dec 1, 2021

Check the assembly at godbolt for 4 different versions of the sum int32_t. Depending on the compiler, the generated code is slightly different. clang seems to assume the pointer to count is restrict and always generates the same optimized code, while gcc is taking a very cautious approach. In general the difference is reloading the *count from the cache instead of working on it from a register.

Conclusion, to remove the compiler from the equation and always be sure the code is as optimal as possible either we add restrict or we do the trick proposed here (temporary variable).

@bosilca
Copy link
Member

bosilca commented Dec 1, 2021

However, I think I would rather have the code with the following pattern:

void
ompi_op_base_2buff_sum_int32_t(const void *in, void *out, int *count)
  {
      int i;
      int32_t *a = (int32_t *) in;
      int32_t *b = (int32_t *) out;
      for (i = *count; i > 0; i--) {
          *(b++) += *(a++);
      }
  }

@jsquyres
Copy link
Member

jsquyres commented Dec 1, 2021

@bosilca Just curious: why keep it incrementing through a pointer dereference?

@bosilca
Copy link
Member

bosilca commented Dec 1, 2021

Not sure I understand the question. Are we talking about count or about the other pointers ?

@jsquyres
Copy link
Member

jsquyres commented Dec 1, 2021

@bosilca You're right... my question was non-sensical. Disregard!

@devreal
Copy link
Contributor

devreal commented Dec 2, 2021

I thought the point of @bosilca 's comment was to walk backwards (i--) instead of forward (i++), which may be more cache efficient (assuming that an application wrote the last bytes last and will access again from the first bytes)

@devreal
Copy link
Contributor

devreal commented Dec 2, 2021

Ahh never mind, the pointers are still walking forward 🤦

bosilca added a commit to bosilca/ompi that referenced this pull request Dec 2, 2021
Remove all ops with 3 buffers, we ended up not using them anywhere in
the code.
Change the loop order in the base MPI_Op to allow for more
optimizations, as discussed in open-mpi#9717.

Fixes open-mpi#9717.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
@bosilca
Copy link
Member

bosilca commented Dec 2, 2021

I created #9719 as an alternative to this. While digging in the code I also noticed we are not using the 3buff MPI_Op so I took the opportunity to remove all support.

bosilca added a commit to bosilca/ompi that referenced this pull request Dec 2, 2021
Remove all ops with 3 buffers, we ended up not using them anywhere in
the code.
Change the loop order in the base MPI_Op to allow for more
optimizations, as discussed in open-mpi#9717.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
bosilca added a commit to bosilca/ompi that referenced this pull request Dec 2, 2021
optimizations, as discussed in open-mpi#9717.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
@gkatev
Copy link
Contributor Author

gkatev commented Dec 2, 2021

FYI, for curiosity's sake I checked with make V=2, and it appears that all commands involving op_base_functions.c do contain -O3. (by default, without explicitely asking for it during configuration/compilation) -- and therefore all my tests here should have contained it.

@ggouaillardet
Copy link
Contributor

@devreal you are absolutely right about pointer aliasing! and that makes the patch necessary to improve performance when the compiler chooses to generate correct code.

jsquyres pushed a commit to jsquyres/ompi that referenced this pull request Jan 23, 2022
optimizations, as discussed in open-mpi#9717.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit 1215776)
@jsquyres
Copy link
Member

We merged #9719 and #9867 proposes removing the 3-buf variants. Should this PR now be closed?

@gkatev gkatev closed this Jan 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants